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ABSTRACT 

The Prokaryotic Operon DataBase (ProOpDB, http:// 
operons.ibt.unam.mx/OperonPredictor) constitutes 
one of the most precise and complete repositories 
of operon predictions now available. Using our novel 
and highly accurate operon identification algorithm, 
we have predicted the operon structures of more 
than 1200 prokaryotic genomes. ProOpDB offers 
diverse alternatives by which a set of operon predic- 
tions can be retrieved including: (i) organism name, 
(ii) metabolic pathways, as defined by the KEGG 
database, (iii) gene orthology, as defined by the 
COG database, (iv) conserved protein domains, as 
defined by the Pfam database, (v) reference gene 
and (vi) reference operon, among others. In order 
to limit the operon output to non-redundant or- 
ganisms, ProOpDB offers an efficient method to 
select the most representative organisms based 
on a precompiled phylogenetic distances matrix. In 
addition, the ProOpDB operon predictions are used 
directly as the input data of our Gene Context Tool 
to visualize their genomic context and retrieve the 
sequence of their corresponding 5' regulatory 
regions, as well as the nucleotide or amino acid se- 
quences of their genes. 

INTRODUCTION 

Recent developments in sequencing methodologies have 
tremendously increased the repertory and size of genome 
sequence databases. More than 1200 prokaryotic and eu- 
karyotic genomes have been completely sequenced now, 
and the sequences of many others are close to being 
finished (http://www.ncbi.nlm.nih.gov/genomes/static/ 
gpstat.html). Moreover, this trend is expected to 
continue as new and more efficient sequencing techniques 
are developed. In this scenario, it becomes essential to 
develop new and better predictive tools for characterizing 
the properties of sequenced genomes. One of these 
properties that have been subject to bioinformatics 



studies is the tendency for coordinating the expression of 
metabolically or functionally related genes. In prokaryotic 
genomes, these genes are commonly found contiguously 
arranged on the same transcriptional strand and are 
co-transcribed in the same transcription units, called 
operons. Operons are the basis to determine structures 
of a higher level of genomic organization as well as differ- 
ent cellular functions, providing important insights for ex- 
perimental designs. Consequently, diverse computer 
methods for the identification of operons have been de- 
veloped and used to predict operons in model organisms, 
such as Escherichia coli (1) or Bacillus subtilis (2) or in the 
fast growing set of fully sequenced genomes. As a result of 
this work, important databases with operon predictions in 
prokaryotic genomes have been developed and are 
publicly available. The strengths and characteristics of 
each of these different databases vary from one to 
another. For example, DOOR (Database of prOkaryotic 
OpeRons) (3) offers diverse querying methods to find par- 
ticular operons, including those with RNA genes. In 
addition, this database provides similarity scores 
between operons by which related operons in different 
organisms can be retrieved. DOOR can also identify 
over-represented sequence motifs in regulatory regions 
of the selected operons using MEME (4) or CUBIC, a 
motif identification program developed by the authors. 
A second database is MicrobesOnline (5) which is one of 
the most complete databases designed to integrate func- 
tional genomic data with comparative genome analyses. In 
order to accomplish this goal, MicrobesOnline has two 
main approaches, the phylogenetic approach, including a 
tree-based browser and tools for users to build their own 
trees, and the functional approach, including a wide set of 
tools to analyze microarray gene expression data and find 
genes that are co-expressed or have a particular expression 
profile. MicrobesOnline also provides tools to identify 
conserved regulatory motifs. Another database public 
available is OperonDB (6) which has the most updated 
list of operons predictions including 1059 bacterial 
genomes. Finally, ODB (Operon DataBase) (7) aims to 
collect operons that have been experimentally determined 
or are conserved in different organisms to define a set of 
reference operons. In this database, operon predictions in 
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a genome are accomplished by mapping orthologs genes 
on the set of reference operons. This database provides 
graphical capabilities to inspect the gene context of the 
selected operons. 

The operon accuracies of the computer algorithms used 
in the mentioned databases, vary from ~80% [in the cases 
of OperonDB (6) and MicrobesOnline (5,8)], to ~90% [in 
the case of the DOOR database (3,9)], when the predic- 
tions are made on model organisms, such as E. coli or B. 
subtilis, and the training and testing datasets are from the 
same organisms. Nevertheless, it is common to observe an 
important accuracy decrement when datasets belong to 
different organisms. In this sense, our Prokaryotic 
Operon Z)atai?ase (ProOpDB) uses a novel operon predic- 
tion algorithm with one of the highest accuracy levels ever 
reported (10) regardless the source of training and testing 
datasets. In addition, ProOpDB offers diverse alternatives 
to retrieve a specific set of operons, making it unique in its 
kind. ProOpDB is one of the most precise and complete 
repositories of operon predictions now available. It 
includes >1200 prokaryotic genomes and a total of 
2 549 412 predicted operons, including RNA genes. 

MAIN ATTRIBUTES OF ProOpDB 

As previously mentioned, there are several kinds of 
operon databases that offer different advantages to the 
users. The main attributes of ProOpDB are as follows: 

ProOpDB contains the most accurate bacterial operon 
predictions 

The operon predictions accuracy of ProOpDB is one of 
the most important characteristics of our database. These 
predictions were generated by our recently published 
operon identification neural network method, which was 
successfully tested on the set of experimentally defined 
operons of E. coli and B. subtilis, with accuracies of 
94.6% and 93.3%, respectively (10). As far as we know, 
these are the highest accuracies ever obtained when pre- 
dicting bacterial operons. Furthermore, one fundamental 
advantage of ProOpDB over other operon databases is 
that the performance of the algorithm used to predict 
operons remains outstandingly high even when the 
training and testing organisms are not the same. For 
example, when our algorithm was trained with B. subtilis 
data to predict E. coli operons the accuracy obtained was 
91.5%, and when the training procedure was done with 
E. coli data to predict B. subtilis operons, the accuracy was 
93% (10). We consider that these accuracies are signifi- 
cantly high, especially taking into account that the 
highest accuracy previously reported in a similar analysis 
was only of 83% (9). Furthermore, to evaluate the per- 
formance of our operon predictive method in organisms 
different than E. coli and B. subtilis, we tested it on a set of 
202 experimentally determined operons compiled in the 
ODB database (7) that includes 433 operonic-gene pairs 
in 50 partially sequenced genomes. The accuracy reached 
by our method in this data set was 92.4%. In addition, we 
also tested our method in a set of 1 145 operonic gene-pairs 
of 522 predicted operons from a genome-wide 



transcriptional study (11). In this case, the accuracy was 
also very high 91.3%. These results show the potential of 
our method to accurately predict the operons of any other 
newly sequenced organism. This generalization capability 
is archived since our operon predictions are based on the 
functional relationships of contiguous genes defined by the 
STRING database (12) that integrates the information 
from distinct kinds of sources of different organisms, 
such as gene neighborhood, gene fusion, gene 
co-occurrence, gene co-expression and protein-protein 
interactions. 

Retrieval of operons in ProOpDB can be based on 
metabolic pathways 

In addition to the commonly used operon criteria, such as 
gene name or gene ID in specific genomes, ProOpDB 
allows operon retrieval and visualization by specific meta- 
bolic pathways of cellular processes as defined by the 
well-known KEGG database (Kyoto Encyclopedia of 
Genes and Genomes) (13). This retrieval mode based on 
metabolic pathways allows any user to perform the 
operon analysis of an organism, or group of organisms, 
from a more integrated point of view in accordance with 
their cellular and biochemical knowledge. For instance, if 
a user is interested in knowing the operons of E. coli K-12 
MG1655 involved in chemotaxis, he/she can simply 
specify the KEGG pathway 02030 to find out that, in 
this organism, there are 20 genes clustered into 9 
operons that are involved in this cellular process. 
Furthermore, with ProOpDB any user can easily identify 
regulatory motifs, not only related to a specific gene 
family, but also to a particular metabolic pathway. For 
example, if a user is interested in identifying regulatory 
elements of genes related to the thiamine metabolism, it 
would be easy to retrieve the intergenic sequences of 
operons encoding genes belonging to this pathway, 
which corresponds to the KEGG pathway 00730. Once 
that the sequences have been retrieved, they can be used 
directly as the input data to the motifs identification 
program MEME (4), which has been locally installed in 
our web server, or as input data to any other motifs iden- 
tification web servers. After this analysis, the user would 
be able to identify the conserved sequence motif of the 
Thi-box riboswitch (14) (Figure 1). 

ProOpDB allows the operon retrieval and visualization 
based on COGs groups 

A frequent approach when looking for related operons in 
databases, such as DOOR (3), is to consider the orthology 
relationships of genes based on the Cluster of Orthologous 
Groups of the COG database (16). However, since the set 
of organisms that has been annotated in the COG 
database is limited to only some of them, the recovery 
of genes by COGs can be highly restricted to certain 
genomes. To overcome this problem, we have assigned 
to each gene a corresponding COG group using Hidden 
Markov Models and the HMMER program (17), so that, 
operons in ProOpDB can be efficiently retrieved by this 
criteria. 
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Figure 1. Operon structures of genes participating in the thiamine metabolism pathway, KEGG 00730 in different organisms. Among the diverse 
alternatives offered by ProOpDB, the selection based on KEGG pathways allows the comparison of the different transcription units that belongs to a 
specific metabolic process in different organisms, (a) The great diversity of operon organization that is involved in the thiamine metabolism can be 
observed. It is important to note that genes, in the ProOpdB output, are colored in accordance to the feature (phylogenetic — COG, metabolic — 
KEGG or conserved protein domains — Pfam), that was used to in the operon retrieval process. In our example, the genes are colored based on the 
KEGG pathway annotations, thus the potential relationships between metabolic pathways can be inferred. For example, genes that belong to the 
thiamine metabolism (KEGG 00730, yellow color) are part of operons co-transcribing genes of the sulfur relay system (KEGG 04122, red color) and 
with genes of the purine metabolism (KEGG 00230, orange color) in Aquifex aeolicus (Aquificae), Corynebacterium diphtherias gravis 
(Actinobacteria) and E. coli K-12 MG1655 (Proteobacteria). (b) The 5' and 3' regulatory sequences of the operon as well as the protein and 
nucleotide sequence of the genes can be retrieved for specific analyses by particular user programs, (c) Finger-print analyses can be performed 
using the locally installed programs in the ProOpDB web server. The redundant sequences are eliminated using the CD-HIT program (15) prior the 
analysis of over-represented motifs using the MEME program (4). 



Operons in ProOpDB can be selected based on the 
conserved domains of their proteins 

The conserved domains defined in Pfam- A (18) of 
each gene in ProOpDB have been annotated using 
the hmmpfam program of the HMMER package (17) 
so that operons may also be retrieved by a given 
conserved domain of their corresponding genes. For 
example, if a user is interested in finding the structure of 
the operons encoding regulatory proteins with the helix- 
turn-helix domain of the LysR family (Pfam HTH_1), he/ 
she will easily find that almost all of them are 
monocistronic. In another instance, if the user needs to 
know the structure of operons carrying the RelB toxin- 
antitoxin system, using as an input the name of the Pfam 
family relB, he/she will discovers that it corresponds to 
bi-cistronic operons encoding proteins with the conserved 
domains of RelB and the plasmid stabilization system 
protein. 



Selection of operons in ProOpDB can be made on the 
basis of a reference gene or a reference operon 

An important feature of ProOpDB is its facility to show 
the structures of the operons that contain a gene, or a 
family of genes, of a particular interest. To this end, we 
have determined the orthology relationship of genes by the 
Bi-Directional-Best-Hit criteria using Blast searches, thus 
all the operons, in a selected set of organisms, containing 
the orthologs of a reference gene will be displayed. 
ProOpDB can also perform a similar task using a refer- 
ence operon as input. In this case, all the orthologs of any 
of the genes of the reference operon will be considered. 

In ProOpDB the nucleotide or protein sequences 
associated to selected operons can be easily retrieved 

ProOpDB also offers the possibility to download the set of 
5' intergenic sequences of the selected operons. These se- 
quences can be further used to perform fingerprint 
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analysis to identify regulatory motifs or any other kind of 
analysis. In addition, it is also possible to retrieve the 
amino acid sequences of the operons encoded proteins. 
These sequences are exported as flat files in Fasta format. 

The selection of organisms in ProOpDB can be done 
based on their taxonomic order or by their phylogenetic 
distances 

As the number of available sequenced genomes increases, 
the need of an efficient protocol to select non-redundant 
organisms from a specific taxon or different taxa becomes 
an important issue. For example, in the set of fully 
sequenced genomes, there are more than 30 E. coli 
strains. In this regard, we have evaluated the phylogenetic 
distance between every pair of organisms in ProOpDB, 
based on the sequence alignment of the gene concaten- 
ation of 31 orthologs present in 191 sequenced genomes 
that have been selected according to Ciccarelli et al. (19). 
This phylogenetic distance information is used by 
PropODB to select the set of less redundant organisms 
from a list of all possible organisms in a given taxon or 
taxa chosen by the author. This property of ProOpDB is 
particularly useful for identifying regulatory sites in a 
finger-print analysis. For example, if the user is interested 
in identifying, in the Gama-proteobacteria, the trypto- 
phan repressor binding site that is used for its 
autoregulation, he/she could restrict the analysis to this 
particular phylogenetic group. A search in ProOpDB 
using 'trpR' as keyword will result in 95 trpR genes with 
this name. To avoid the inclusion of many E. coli closely 
related strains (e.g. E. coli 0127, E. coli 55989, E. coli 
APEC, E. coli BW2952, etc.), the user could ask for a 
smaller number of organisms, for example 30. Using a 
pre-compiled phylogenetic distances matrix between or- 
ganisms, ProOpDB will select the less redundant set of 
these 30 Gama-proteobacteria organisms from where the 
trp regulatory regions can be obtained to be used in the 
regulatory finger-print analysis. 

IMPLEMENTATION AND WEB INTERFACE 

ProOpDB is organized in five modules. The first module, 
data acquisition, is dedicated to retrieving primary infor- 
mation from the KEGG flat files database (13), including 
the genomic sequence of organisms and information of 
their corresponding genes, such as their names, functions 
and their corresponding metabolic pathways. In the 
second module, data analysis, every gene in ProOpDB is 
annotated with its corresponding COG (16) or ROC (10) 
groups using our Hidden Markov Models and the 
hmmsearch program of the HMMER package (17). In 
addition, the exact position of conserved protein 
domains using the Pfam-A models (18) was determined 
using the hmmpfam program of the HMMER package 
(17). In this module, BLAST comparisons (20) are per- 
formed to identify bi-directional best hits (BDBH) 
among proteins of the same COG to establish likely 
orthology relationships. In the third module, operon pre- 
dictions, the operonic or non-operonic nature of every pair 
of genes is predicted using our recently developed program 



(10) based on an artificial neural network. The input vari- 
ables of this neural network are intergenic distances and 
functional relationships between the protein products of 
contiguous genes, as defined by STRING database (12), 
afterward, the structure of all the operons are determined. 
In the fourth module, database management, all the 
acquired or generated information is stored using a rela- 
tional database management system (MySQL, http:// 
www.mysql.com/). The fifth module, web server service, 
is related to subroutines and modules required for the ac- 
cessibility and display capabilities of our web interface. In 
this module, the names of the genes that were obtained as 
result of the operon predictions analysis are sent to our 
web server Gene Context Tool (GeConT) for their graphic 
representation. GeConT uses a collection of Perl-CGI 
programs using the open source code GD graphics 
library and JavaScript codes to create HTML files. 

FUTURE PLANS 

The next update of ProOpDB will have a fully automatic 
update module, so that all of the above-mentioned 
modules will run automatically. This enhancement will 
ensure that ProOpDB will always be updated. 
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